In this analysis we’ll be looking at cells undergoing neurogenesis in the developing human frontal cortex. The data are taken from Nowakowski et al (2017). Our goal is to build a simple method for fitting single cell gene regulatory networks (GRNs) and compare the results across pseudotime lineages, keeping in mind the trajectory structure of the data. If you want to skip straight to the analysis, head to Section 6.
2 Libraries
We start by loading in a few necessary packages & resolving a few function conflicts.
Unfortunately the data that are made available as part of the scRNAseq package come in a transcripts-per-million (TPM) normalized matrix - not raw counts. A rough conversion back to the raw integer counts is to multiply each cell’s normalized expression vector by the TPM size factor. This gets us most of the way there but the results are still estimates, so we round them to the nearest integer.
Note
This conversion ignores gene length, which is why the results are estimates & require rounding. For a more detailed discussion on how to un-normalize TPM counts see this StackOverflow post. While the original authors used a plate-based protocol (Fluidigm C1) they do not mention whether they used unique molecular identifiers (UMIs); if UMIs were used then gene length normalization isn’t necessary. Since simply reversing the TPM transformation doesn’t result in integer values I would guess that UMIs were not used, though their Supplementary Methods don’t mention gene length normalization.
We run the cells through a basic preprocessing pipeline: QC, normalization, highly variable gene (HVG) identification, PCA, UMAP, and graph-based Leiden clustering.
Here we examine some of our dataset’s attributes - the original celltype annotations, our new clusters, age in post-conception weeks (PCW), and the proportion of each cell’s reads aligning to ribosomal genes.
Figure 1: Characteristics of the cortical development dataset
6 Analysis
6.1 Celltype annotation
We begin by running a basic differential expression (DE) testing routine to identify some potential marker genes for each cluster. This will help us assign a celltype identity to each cluster.
We visualize the markers for each cluster with a dotplot, which shows clear differences between clusters. For example, the transcription factor (TF) aristaless related homeobox (ARX) is specifically expressed in cluster 7; this TF’s expression is known to be necessary for healthy neuronal development and helps to regulate differentiation.
The markers seem consistent with what we’d expect for excitatory neurons; for more information see e.g., Figure 4 and Suppl. Figures 4 & 11 in the original publication.
Table 1: The top 3 markers genes for each identified celltype
6.2 Pseudotime estimation
After estimating principal curves across our UMAP embedding with slingshot, we generate per-lineage pseudotime values for each cell. We use our clustering as a source of structure in the data, and assign start and end clusters based on age in PCW. In addition, we mean-aggregate the pseudotime values for each cell in order to get a total pseudotime which represents overall differentiation regardless of lineage.
The graph structure estimate by slingshot is called a minimum spanning tree (MST), and describes the relationships between clusters in an undirected manner.
Code
p7 <-data.frame(Embeddings(seu_cortex, "umap")) %>%mutate(leiden = seu_cortex$seurat_clusters) %>%ggplot(aes(x = umap_1, y = umap_2, color = leiden)) +geom_point(size =1.25, alpha =0.75) +geom_path(data = sling_mst, mapping =aes(x = umap_1, y = umap_2, group = Lineage), linewidth =1.25, color ="black") +geom_point(data = sling_mst, mapping =aes(x = umap_1, y = umap_2, fill = Cluster), color ="black", shape =21, size =4.5, stroke =1.25, show.legend =FALSE) +scale_color_manual(values = palette_cluster) +scale_fill_manual(values = palette_cluster) +labs(x ="UMAP 1", y ="UMAP 2") +theme_scLANE(umap =TRUE) +theme(legend.title =element_blank()) +guide_umap()p7
Figure 3: UMAP embedding with MST from Slinghot overlaid
Since our data have real-life experimental timepoints, it makes sense to normalize the pseudotime within each lineage to \([0, 1]\).
Figure 4: UMAP embedding colored by per-lineage pseudotime
The principal curves show a trifurcating lineage structure.
Code
p9 <-data.frame(Embeddings(seu_cortex, "umap")) %>%mutate(leiden = seu_cortex$seurat_clusters) %>%ggplot(aes(x = umap_1, y = umap_2, color = leiden)) +geom_point(size =1.25, alpha =0.75) +geom_path(data = sling_curves,mapping =aes(x = umap_1, y = umap_2, group = Lineage), color ="black", linewidth =1.5, alpha =0.75, lineend ="round") +scale_color_manual(values = palette_cluster) +labs(x ="UMAP 1", y ="UMAP 2") +theme_scLANE(umap =TRUE) +theme(legend.title =element_blank()) +guides(color =guide_legend(override.aes =list(size =4, alpha =1, stroke =0.25)))p9
Figure 5: UMAP embedding with principal curves from Slinghot overlaid in black
Lastly, we plot the distribution of mean-aggregated pseudotime per-timepoint and see that later timepoints have, in general, larger values of overall pseudotime. This indicates that our estimated pseudotime is linked to real experimental time in a meaningful (though fairly noisy) way.
Code
p10 <-select(seu_cortex@meta.data, Age_in_Weeks, PT_Overall) %>%mutate(Age_in_Weeks =as.factor(Age_in_Weeks)) %>%ggplot(aes(x = Age_in_Weeks, y = PT_Overall, color = Age_in_Weeks)) + ggbeeswarm::geom_quasirandom(size =1.25, alpha =0.75, show.legend =FALSE) +stat_summary(geom ="point", fun ="mean", color ="black",size =3) +scale_color_manual(values = palette_timepoint) +labs(x ="Age (PCW)", y ="Mean Psueodtime") +theme_scLANE()p10
Figure 6: Beeswarm plot displaying the distribution of mean-aggregated pseudotime for each timepoint
6.3 Trajectory DE testing
Using scLANE (see the GitHub repository or the preprint for details) we perform trajectory DE significance testing for each HVG across each lineage.
The GRN has the following structure - for every target gene, a penalized model is fit with the dynamics of every possible TF as features (see below). The model’s internal feature selection identifies a subset of TFs whose dynamics are predictive of the target gene’s dynamics. See e.g., the original XGBoost paper for details.
Our GRNs will be lineage-specific, since the matrix of gene dynamics differs by lineage. First we fit the GRN for lineage A; I’ve opted to use the LightGBM package to fit the models as it’s lightweight and in general much faster than other options such as XGBoost. In addition, I prefer this method to alternative penalized methods such as LASSO - boosted trees better capture nonlinearities in the data, though LASSO does offer advantages such as significance testing for coefficients and linear interpretability. Both methods have built-in cross-validation (CV) routines (lightgbm::lgb.cv() and glmnet::cv.glmnet(), respectively); here we use 5-fold CV to select the best set of hyperparameters before training the final model.
Note
Since we’re using the foreach and doSNOW packages to loop over the set of target genes in parallel, we need to ensure that our LightGBM model only uses a single thread when fitting the model. This is done by setting the parameters tree_learner = "serial" and num_threads = 1L in the model parameters list.
Next we add the Spearman correlation between TF and target gene dynamics, as well as the normalized gene expression correlation. We denote the relationship between TF and target gene as repression if the correlation between their dynamics is negative, and activation if the correlation is positive.
Warning
Usually the correlations between TF and target dynamics and TF and target expression are similar, or at least have the same directionality (positive or negative). However, when expression patterns are noisy, the correlation directions might differ (see Table 4 for examples). In this case, we’re opting to trust the dynamics more, as they are a smoothed version of expression - though an argument could be made for the opposite stance as well.
The output is shown below. Here we use the frequency with which the TF is included in the final boosted tree model as the feature importance, which can be interpreted as the strength of the predictive relationship between TF & target gene. See e.g., this StackExchange post for more detail.
We repeat the fitting process for the other two lineages (expand this code block to see the details). Skip to Section 7 if you want a version of this logic that’s been wrapped into a function!
One TF of particular interest is SRY-box transcription factor 4 (SOX4). This TF helps to regulate development and is involved in the specification of cell fate, so a comparison of its behavior across lineages should be pretty neat. We use a Fruchterman-Reingold layout to embed the graphs in 2D space via the layout_with_fr() function from igraph. We see that not only do the top target genes differ widely by lineage, so too do the directions of the relationships (activation vs. repression), and the strengths of those relationships. For example, in lineage A the top target gene is BACE2, whereas for lineage B it’s LINC01268.
Figure 7: Fruchterman-Reingold embeddings of the relationships between SRY-box transcription factor 4 and lineage-specific target genes
7 Conclusions
Hopefully this analysis has demonstrated the utility of trajectory GRNs for comparing gene dynamics across lineages. Beyond just visual comparison of how gene expression changes over pseudotime, trajectory GRNs allow us to identify which transcription factors are activating or suppressing target genes, how top TF targets differ between lineages, and the ways in which TF-target relationships change based on the cell fate towards which cells are progressing. To conclude, I’ve written up the methodology for creating a trajectory GRN into a function, which is shown in the code block below. It includes functionality for a progress bar if desired, and takes as input a matrix of gene dynamics (expr.mat), a set of dynamic genes (dyn.genes), and a set of transcription factors (tx.factors). Eventually I plan to expand this functionality and turn it into an R package, but for now it’ll stay a simple function.
---title: "Building a Trajectory Regulatory Network with scRNA-seq Data"author: name: Jack Leary email: j.leary@ufl.edu orcid: 0009-0004-8821-3269 affiliations: - name: University of Florida department: Biostatistics city: Gainesville state: FLdate: todaydate-format: longformat: html: code-fold: show code-copy: true code-tools: true toc: true toc-depth: 2 embed-resources: true fig-format: retina fig-width: 9 fig-height: 6 df-print: kable link-external-newwindow: true tbl-cap-location: bottom number-sections: trueexecute: cache: false freeze: auto---```{r setup}#| include: falseknitr::opts_chunk$set(comment = NA)reticulate::use_virtualenv("~/Desktop/PhD/Research/Python_Envs/personal_site/", required = TRUE)set.seed(312)```# Introduction {#sec-intro}In this analysis we'll be looking at cells undergoing neurogenesis in the developing human frontal cortex. The data are taken from [Nowakowski *et al* (2017)](https://doi.org/10.1126/science.aap8809). Our goal is to build a simple method for fitting single cell gene regulatory networks (GRNs) and compare the results across pseudotime lineages, keeping in mind the trajectory structure of the data. If you want to skip straight to the analysis, head to @sec-analysis. # Libraries {#sec-libs}We start by loading in a few necessary packages & resolving a few function conflicts. ```{r}#| message: false#| warning: falselibrary(dplyr) # data manipulationlibrary(Seurat) # scRNA-seq toolslibrary(scLANE) # trajectory DElibrary(igraph) # graph utilitieslibrary(ggplot2) # pretty plotslibrary(biomaRt) # gene annotationlibrary(foreach) # parallel loopslibrary(ggnetwork) # plotting graphslibrary(slingshot) # pseudotime estimationlibrary(patchwork) # plot alignmentlibrary(SingleCellExperiment) # scRNA-seq data structuresrename <- dplyr::renameselect <- dplyr::select```# Helper functions {#sec-fns}We define a helper function to make our legends look prettier. ```{r}guide_umap <-function(key.size =4) { ggplot2::guides(color = ggplot2::guide_legend(override.aes =list(size = key.size, alpha =1, stroke =0.25)))}```Next, we assign a few consistent color palettes that we'll use throughout. ```{r}palette_heatmap <- paletteer::paletteer_d("MetBrewer::Hiroshige", direction =-1)palette_celltype <- paletteer::paletteer_d("ggsci::category10_d3")palette_cluster <- paletteer::paletteer_d("ggsci::default_locuszoom")palette_timepoint <- paletteer::paletteer_d("ggsci::default_igv")```# Data {#sec-data}Unfortunately the data that are made available as part of the `scRNAseq` package come in a transcripts-per-million (TPM) normalized matrix - not raw counts. A rough conversion back to the raw integer counts is to multiply each cell's normalized expression vector by the TPM size factor. This gets us most of the way there but the results are still estimates, so we round them to the nearest integer. ::: {.callout-note}This conversion ignores gene length, which is why the results are estimates & require rounding. For a more detailed discussion on how to un-normalize TPM counts see [this StackOverflow post](https://stackoverflow.com/questions/60614972/converting-tpm-data-to-read-counts-for-seurat). While the original authors used a plate-based protocol (Fluidigm C1) they do not mention whether they used unique molecular identifiers (UMIs); if UMIs were used then gene length normalization isn't necessary. Since simply reversing the TPM transformation doesn't result in integer values I would guess that UMIs were not used, though [their Supplementary Methods](https://www.science.org/doi/10.1126/science.aap8809#supplementary-materials) don't mention gene length normalization. :::```{r}#| results: hide#| message: false#| warning: falsesce_cortex <- scRNAseq::NowakowskiCortexData()counts_matrix <- Matrix::t(sce_cortex@assays@data$tpm)scale_factors <- Matrix::rowSums(counts_matrix) /1e6counts_matrix <-round(Matrix::t(counts_matrix) * scale_factors)seu_cortex <-CreateSeuratObject(counts_matrix, assay ="RNA", meta.data =as.data.frame(colData(sce_cortex)), project ="cortex", min.cells =5,min.features =0)excitatory_neuron_celltypes <-c("nEN-late", "nEN-early1", "nEN-early2","EN-V1-1", "EN-V1-2", "EN-V1-3", "EN-PFC1", "EN-PFC2", "EN-PFC3")seu_cortex <- seu_cortex[, seu_cortex$WGCNAcluster %in% excitatory_neuron_celltypes]seu_cortex@meta.data <-mutate(seu_cortex@meta.data, celltype_original =case_when(WGCNAcluster =="nEN-early1"~"Early Exc. Neuron 1", WGCNAcluster =="nEN-early2"~"Early Exc. Neuron 2", WGCNAcluster =="nEN-late"~"Late Exc. Neuron", WGCNAcluster =="EN-V1-1"~"Exc. Neuron 1", WGCNAcluster =="EN-V1-2"~"Exc. Neuron 2", WGCNAcluster =="EN-V1-3"~"Exc. Neuron 3", WGCNAcluster =="EN-PFC1"~"PFC Exc. Neuron 1", WGCNAcluster =="EN-PFC2"~"PFC Exc. Neuron 2", WGCNAcluster =="EN-PFC3"~"PFC Exc. Neuron 3", TRUE~NA_character_))```# Preprocessing {#sec-proprocess}We run the cells through a basic preprocessing pipeline: QC, normalization, highly variable gene (HVG) identification, PCA, UMAP, and graph-based Leiden clustering. ```{r}#| message: false #| warning: falseseu_cortex <- seu_cortex %>%PercentageFeatureSet(pattern ="^MT-", col.name ="percent_mito") %>%PercentageFeatureSet(pattern ="^RP[SL]", col.name ="percent_ribo") %>%NormalizeData(verbose =FALSE) %>%FindVariableFeatures(nfeatures =3000, verbose =FALSE) %>%ScaleData(verbose =FALSE) %>%RunPCA(npcs =30, approx =TRUE, seed.use =312, verbose =FALSE) %>%CellCycleScoring(s.features = cc.genes.updated.2019$s.genes, g2m.features = cc.genes.updated.2019$g2m.genes, set.ident =FALSE) %>%RunUMAP(reduction ="pca", dims =1:30,n.neighbors =20, n.components =2, metric ="cosine", n.epochs =750,seed.use =312, verbose =FALSE) %>%FindNeighbors(reduction ="pca", k.param =20,nn.method ="annoy", annoy.metric ="cosine", verbose =FALSE) %>%FindClusters(algorithm =4, method ="igraph", resolution =0.5, random.seed =312, verbose =FALSE)```Here we examine some of our dataset's attributes - the original celltype annotations, our new clusters, age in post-conception weeks (PCW), and the proportion of each cell's reads aligning to ribosomal genes. ```{r}#| message: false#| warning: false#| code-fold: true#| fig-width: 10#| fig-height: 6#| fig-cap: Characteristics of the cortical development dataset#| label: fig-EDAp1 <-DimPlot(seu_cortex, group.by ="celltype_original",pt.size =1.25, alpha =0.75) +scale_color_manual(values = palette_celltype) +theme_scLANE(umap =TRUE) +labs(color ="Celltype") +theme(axis.title =element_blank(), plot.title =element_blank()) +guide_umap()p2 <-FeaturePlot(seu_cortex, features ="Age_in_Weeks", pt.size =1.25, alpha =0.75) +scale_color_gradientn(colors = palette_heatmap) +labs(color ="Age (weeks)") +theme_scLANE(umap =TRUE) +theme(axis.title =element_blank(), plot.title =element_blank())p3 <-DimPlot(seu_cortex, group.by ="seurat_clusters", pt.size =1.25, alpha =0.75) +scale_color_manual(values = palette_cluster) +labs(color ="Leiden Cluster") +theme_scLANE(umap =TRUE) +theme(axis.title =element_blank(), plot.title =element_blank()) +guide_umap()p4 <-FeaturePlot(seu_cortex, features ="percent_ribo", pt.size =1.25, alpha =0.75) +scale_color_gradientn(colors = palette_heatmap, labels = scales::label_percent(accuracy =1, scale =1)) +labs(color ="Ribosomal\nReads") +theme_scLANE(umap =TRUE) +theme(axis.title =element_blank(), plot.title =element_blank())p5 <- (p1 / p2) | (p3 / p4)p5 <- ggpubr::annotate_figure(ggpubr::ggarrange(p5), bottom ="UMAP 1",left ="UMAP 2")p5```# Analysis {#sec-analysis}## Celltype annotationWe begin by running a basic differential expression (DE) testing routine to identify some potential marker genes for each cluster. This will help us assign a celltype identity to each cluster. ```{r}cluster_markers <-FindAllMarkers(seu_cortex, assay ="RNA", logfc.threshold =0.25, test.use ="wilcox", slot ="data", min.pct =0.1, only.pos =TRUE, verbose =FALSE, random.seed =312)top5_cluster_markers <-arrange(cluster_markers, p_val_adj) %>%with_groups(cluster, slice_head, n =5)```We visualize the markers for each cluster with a dotplot, which shows clear differences between clusters. For example, the transcription factor (TF) aristaless related homeobox (*ARX*) is specifically expressed in cluster 7; this TF's expression is known to be necessary for healthy neuronal development and helps to regulate differentiation. ```{r}#| code-fold: true#| fig-width: 5#| fig-height: 7#| fig-cap: Putative marker genes for each cluster#| label: fig-dotplot_clusterp6 <-DotPlot(seu_cortex, features =unique(top5_cluster_markers$gene),assay ="RNA", group.by ="seurat_clusters", dot.scale =5,cols = paletteer::paletteer_d("wesanderson::Zissou1")[c(1, 5)], scale.by ="radius") +coord_flip() +labs(y ="Leiden Cluster") +theme_scLANE() +theme(axis.title.y =element_blank(), axis.text.y =element_text(face ="italic"))p6```Now we identify markers for each celltype:```{r}Idents(seu_cortex) <-"celltype_original"celltype_markers <-FindAllMarkers(seu_cortex, assay ="RNA", logfc.threshold =0.25, test.use ="wilcox", slot ="data", min.pct =0.1, only.pos =TRUE, verbose =FALSE, random.seed =312)top3_celltype_markers <-arrange(celltype_markers, p_val_adj) %>%with_groups(cluster, slice_head, n =3)```The markers seem consistent with what we'd expect for excitatory neurons; for more information see e.g., Figure 4 and Suppl. Figures 4 & 11 in [the original publication](https://doi.org/10.1126/science.aap8809). ```{r}#| code-fold: true#| tbl-cap: The top 3 markers genes for each identified celltype#| label: tbl-celltype_markersselect(top3_celltype_markers, celltype = cluster, gene, avg_log2FC, p_val_adj) %>% kableExtra::kbl(digits =3, booktabs =TRUE, col.names =c("Celltype", "Gene", "Mean log2FC", "Adj. P-value")) %>% kableExtra::kable_classic(full_width =FALSE, "hover")```## Pseudotime estimationAfter estimating principal curves across our UMAP embedding with `slingshot`, we generate per-lineage pseudotime values for each cell. We use our clustering as a source of structure in the data, and assign start and end clusters based on age in PCW. In addition, we mean-aggregate the pseudotime values for each cell in order to get a total pseudotime which represents overall differentiation regardless of lineage. ```{r}sling_res <-slingshot(Embeddings(seu_cortex, "umap"), clusterLabels = seu_cortex$seurat_clusters, start.clus =c("3"), end.clus =c("5", "6", "7"), approx_points =1e3)sling_curves <-slingCurves(sling_res, as.df =TRUE)sling_mst <-slingMST(sling_res, as.df =TRUE)sling_pt <-as.data.frame(slingPseudotime(sling_res)) %>%rowwise() %>%mutate(PT_Overall =mean(c_across(starts_with("Lineage")), na.rm =TRUE)) %>%ungroup() %>%mutate(across(c(starts_with("Lineage"), PT_Overall), \(x) (x -min(x, na.rm =TRUE)) / (max(x, na.rm =TRUE) -min(x, na.rm =TRUE))))seu_cortex <-AddMetaData(seu_cortex, metadata = sling_pt$PT_Overall, col.name ="PT_Overall")```The graph structure estimate by `slingshot` is called a minimum spanning tree (MST), and describes the relationships between clusters in an undirected manner. ```{r}#| code-fold: true#| fig-width: 6#| fig-height: 4#| fig-cap: UMAP embedding with MST from Slinghot overlaid#| label: fig-umap_MSTp7 <-data.frame(Embeddings(seu_cortex, "umap")) %>%mutate(leiden = seu_cortex$seurat_clusters) %>%ggplot(aes(x = umap_1, y = umap_2, color = leiden)) +geom_point(size =1.25, alpha =0.75) +geom_path(data = sling_mst, mapping =aes(x = umap_1, y = umap_2, group = Lineage), linewidth =1.25, color ="black") +geom_point(data = sling_mst, mapping =aes(x = umap_1, y = umap_2, fill = Cluster), color ="black", shape =21, size =4.5, stroke =1.25, show.legend =FALSE) +scale_color_manual(values = palette_cluster) +scale_fill_manual(values = palette_cluster) +labs(x ="UMAP 1", y ="UMAP 2") +theme_scLANE(umap =TRUE) +theme(legend.title =element_blank()) +guide_umap()p7```Since our data have real-life experimental timepoints, it makes sense to normalize the pseudotime within each lineage to $[0, 1]$.```{r}#| code-fold: true#| fig-width: 6#| fig-height: 6#| fig-cap: UMAP embedding colored by per-lineage pseudotime#| label: fig-umap_lineage_PTp8 <-data.frame(Embeddings(seu_cortex, "umap")) %>%bind_cols(sling_pt) %>% tidyr::pivot_longer(starts_with("Lineage"), names_to ="lineage", values_to ="pseudotime") %>%ggplot(aes(x = umap_1, y = umap_2, color = pseudotime)) +facet_wrap(~lineage, nrow =3) +geom_point(size =1.25, alpha =0.75) +labs(x ="UMAP 1", y ="UMAP 2", color ="Pseudotime") +scale_color_gradientn(colors = palette_heatmap, labels = scales::label_number(accuracy = .01)) +theme_scLANE(umap =TRUE)p8```The principal curves show a trifurcating lineage structure. ```{r}#| code-fold: true#| fig-width: 6#| fig-height: 4#| fig-cap: UMAP embedding with principal curves from Slinghot overlaid in black#| label: fig-umap_curvesp9 <-data.frame(Embeddings(seu_cortex, "umap")) %>%mutate(leiden = seu_cortex$seurat_clusters) %>%ggplot(aes(x = umap_1, y = umap_2, color = leiden)) +geom_point(size =1.25, alpha =0.75) +geom_path(data = sling_curves,mapping =aes(x = umap_1, y = umap_2, group = Lineage), color ="black", linewidth =1.5, alpha =0.75, lineend ="round") +scale_color_manual(values = palette_cluster) +labs(x ="UMAP 1", y ="UMAP 2") +theme_scLANE(umap =TRUE) +theme(legend.title =element_blank()) +guides(color =guide_legend(override.aes =list(size =4, alpha =1, stroke =0.25)))p9```Lastly, we plot the distribution of mean-aggregated pseudotime per-timepoint and see that later timepoints have, in general, larger values of overall pseudotime. This indicates that our estimated pseudotime is linked to real experimental time in a meaningful (though fairly noisy) way. ```{r}#| code-fold: true#| fig-width: 8#| fig-height: 5#| fig-cap: Beeswarm plot displaying the distribution of mean-aggregated pseudotime for each timepoint#| label: fig-PCW_PTp10 <-select(seu_cortex@meta.data, Age_in_Weeks, PT_Overall) %>%mutate(Age_in_Weeks =as.factor(Age_in_Weeks)) %>%ggplot(aes(x = Age_in_Weeks, y = PT_Overall, color = Age_in_Weeks)) + ggbeeswarm::geom_quasirandom(size =1.25, alpha =0.75, show.legend =FALSE) +stat_summary(geom ="point", fun ="mean", color ="black",size =3) +scale_color_manual(values = palette_timepoint) +labs(x ="Age (PCW)", y ="Mean Psueodtime") +theme_scLANE()p10```## Trajectory DE testingUsing `scLANE` (see [the GitHub repository](https://github.com/jr-leary7/scLANE) or [the preprint](https://doi.org/10.1101/2023.12.19.572477) for details) we perform trajectory DE significance testing for each HVG across each lineage. ```{r}top3k_hvg <-HVFInfo(seu_cortex) %>%arrange(desc(variance.standardized)) %>%slice_head(n =3000) %>%rownames(.)cell_offset <-createCellOffset(seu_cortex)pt_df <-select(sling_pt, -PT_Overall)scLANE_models <-testDynamic(seu_cortex, pt = pt_df, genes = top3k_hvg,size.factor.offset = cell_offset, n.cores =6L, verbose =FALSE)scLANE_de_res <-getResultsDE(scLANE_models)```We now have lineage-specific trajectory DE statistics for each gene: ```{r}#| code-fold: true#| tbl-cap: Top-10 most DE genes from scLANE#| label: tbl-scLANE_outputselect(scLANE_de_res, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Lineage, Gene_Dynamic_Overall) %>%mutate(Gene_Dynamic_Lineage =if_else(Gene_Dynamic_Lineage ==1, "Dynamic", "Static"), Gene_Dynamic_Overall =if_else(Gene_Dynamic_Overall ==1, "Dynamic", "Static")) %>%slice_head(n =10) %>% kableExtra::kbl(digits =4, booktabs =TRUE, col.names =c("Gene", "Lineage", "LRT stat.", "P-value", "Adj. p-value", "Gene status (lineage)", "Gene status (overall)")) %>% kableExtra::kable_classic(full_width =FALSE, "hover")```We identify a set of dynamic genes - dynamic meaning DE over any of the three lineages - to be used for downstream analysis. ```{r}dyn_genes <-filter(scLANE_de_res, Gene_Dynamic_Overall ==1) %>%distinct(Gene) %>%pull(Gene)```Lastly, we pull a table of gene dynamics for each lineage. ```{r}smoothed_counts <-smoothedCountsMatrix(scLANE_models, size.factor.offset = cell_offset, pt = pt_df, genes = dyn_genes, log1p.norm =TRUE, n.cores =4L)```## Constructing a GRN from scratchBuilding a GRN requires two main inputs - a set of transcription factors, and a set of potential target genes. ```{r}#| message: false#| warning: falsehs_ensembl <-useMart("ensembl", dataset ="hsapiens_gene_ensembl")hs_tf_raw <- readr::read_csv("http://humantfs.ccbr.utoronto.ca/download/v_1.01/DatabaseExtract_v_1.01.csv",col_select =-1,num_threads =2L,show_col_types =FALSE) %>% janitor::clean_names() %>%filter(is_tf =="Yes") %>%select(ensembl_id)hs_tfs <-getBM(attributes =c("ensembl_gene_id", "hgnc_symbol", "entrezgene_id", "description", "gene_biotype"),filters ="ensembl_gene_id",values = hs_tf_raw$ensembl_id,mart = hs_ensembl,uniqueRows =TRUE) %>%rename(ensembl_id = ensembl_gene_id,entrez_id = entrezgene_id) %>%arrange(ensembl_id) %>%mutate(hgnc_symbol =if_else(hgnc_symbol =="", NA_character_, hgnc_symbol),description =gsub("\\[Source.*", "", description))```Along with the HGNC symbol, we used `BiomaRt` to retrieve the Ensembl & Entrez IDs and a description for every TF. ```{r}#| code-fold: true#| tbl-cap: Human transcription factors#| label: tbl-TFsslice_sample(hs_tfs, n =10) %>% kableExtra::kbl(booktabs =TRUE, col.names =c("Ensembl ID", "HGNC symbol", "Entrez ID", "Description", "Biotype")) %>% kableExtra::kable_classic(full_width =FALSE, "hover")```We split our set of `r length(dyn_genes)` trajectory DE genes into the two categories using the existing set of human TFs I found online. ```{r}valid_hs_tfs <- hs_tfs$hgnc_symbol[hs_tfs$hgnc_symbol %in% dyn_genes]target_genes <- dyn_genes[!dyn_genes %in% valid_hs_tfs]```The GRN has the following structure - for every target gene, a penalized model is fit with the dynamics of every possible TF as features (see below). The model's internal feature selection identifies a subset of TFs whose dynamics are predictive of the target gene's dynamics. See e.g., [the original XGBoost paper](https://doi.org/10.48550/arXiv.1603.02754) for details. $$\text{Dynamics}_{\text{Target}} = \begin{bmatrix} \text{Dynamics}_{\text{TF}_1} & \text{Dynamics}_{\text{TF}_2} & \dots & \text{Dynamics}_{\text{TF}_N} \end{bmatrix}$$ {#eq-XGBoost}Our GRNs will be lineage-specific, since the matrix of gene dynamics differs by lineage. First we fit the GRN for lineage A; I've opted to use the `LightGBM` package to fit the models as it's lightweight and in general much faster than other options such as `XGBoost`. In addition, I prefer this method to alternative penalized methods such as LASSO - boosted trees better capture nonlinearities in the data, though LASSO does offer advantages such as significance testing for coefficients and linear interpretability. Both methods have built-in cross-validation (CV) routines (`lightgbm::lgb.cv()` and `glmnet::cv.glmnet()`, respectively); here we use 5-fold CV to select the best set of hyperparameters before training the final model. ::: {.callout-note}Since we're using the `foreach` and `doSNOW` packages to loop over the set of target genes in parallel, we need to ensure that our `LightGBM` model only uses a single thread when fitting the model. This is done by setting the parameters `tree_learner = "serial"` and `num_threads = 1L` in the model parameters list. :::```{r}#| message: false#| warning: falsecl <- parallel::makeCluster(6L)doSNOW::registerDoSNOW(cl)lgbm_params <-list(objective ="gamma", tree_learner ="serial", metric ="l2", boosting_type ="gbdt", num_threads =1L, seed =312)grn_res_linA <-foreach(g =seq(target_genes), .combine ="list",.multicombine =ifelse(length(target_genes) >1, TRUE, FALSE),.maxcombine =ifelse(length(target_genes) >1, length(target_genes), 2),.packages =c("lightgbm", "dplyr"),.errorhandling ="pass",.inorder =TRUE,.verbose =FALSE) %dopar% { feature_mat <- smoothed_counts$Lineage_A[, valid_hs_tfs] resp_var <- smoothed_counts$Lineage_A[, target_genes[g]] lgbm_data <- lightgbm::lgb.Dataset(data = feature_mat, label = resp_var) lgbm_cv <- lightgbm::lgb.cv(params = lgbm_params, data = lgbm_data, nrounds =100L, nfold =5L, stratified =FALSE) lgbm_model <- lightgbm::lgb.train(params = lgbm_params, data = lgbm_data, nrounds = lgbm_cv$best_iter) imp_table <-as.data.frame(lightgbm::lgb.importance(lgbm_model)) %>% dplyr::mutate(Lineage ="A", Target_Gene = target_genes[g], .before =1) imp_table }names(grn_res_linA) <- target_genesgrn_table_linA <- purrr::imap(grn_res_linA, \(x, y) {if (!inherits(x, "data.frame")) { empty_res <-data.frame(Lineage ="A", Target_Gene = y, Feature =NA_character_, Gain =NA_real_, Cover =NA_real_, Frequency =NA_real_)return(empty_res) } else {return(x) }})```We coerce the model output into a tidy table. ```{r}grn_table_linA <- purrr::reduce(grn_table_linA, rbind) %>%rename(Tx_Factor = Feature) %>%filter(!is.na(Target_Gene), !is.na(Tx_Factor)) %>%arrange(Tx_Factor, desc(Frequency))```Next we add the Spearman correlation between TF and target gene dynamics, as well as the normalized gene expression correlation. We denote the relationship between TF and target gene as repression if the correlation between their dynamics is negative, and activation if the correlation is positive. ::: {.callout-warning}Usually the correlations between TF and target dynamics and TF and target expression are similar, or at least have the same directionality (positive or negative). However, when expression patterns are noisy, the correlation directions might differ (see @tbl-GRN_linA for examples). In this case, we're opting to trust the dynamics more, as they are a smoothed version of expression - though an argument could be made for the opposite stance as well. :::```{r}dyn_cormat <-cor(smoothed_counts$Lineage_A, method ="spearman")exp_cormat <-cor(as.matrix(Matrix::t(seu_cortex@assays$RNA$data[dyn_genes, ])), method ="spearman")grn_table_linA <-mutate(grn_table_linA, Dyn_Cor =NA_real_, Exp_Cor =NA_real_)for (i inseq(nrow(grn_table_linA))) { grn_table_linA$Dyn_Cor[i] <- dyn_cormat[grn_table_linA$Target_Gene[i], grn_table_linA$Tx_Factor[i]] grn_table_linA$Exp_Cor[i] <- exp_cormat[grn_table_linA$Target_Gene[i], grn_table_linA$Tx_Factor[i]]}grn_table_linA <-mutate(grn_table_linA, State =if_else(Dyn_Cor <0, "Repression", "Activation"))```The output is shown below. Here we use the frequency with which the TF is included in the final boosted tree model as the feature importance, which can be interpreted as the strength of the predictive relationship between TF & target gene. See e.g., [this StackExchange post](https://datascience.stackexchange.com/questions/12318/how-to-interpret-the-output-of-xgboost-importance) for more detail. ```{r}#| code-fold: true#| tbl-cap: GRN for dynamic genes across Lineage A#| label: tbl-GRN_linAselect(grn_table_linA, Lineage, Tx_Factor, Target_Gene, Frequency, Dyn_Cor, Exp_Cor, State) %>%slice_sample(n =10) %>% kableExtra::kbl(digits =4, booktabs =TRUE, col.names =c("Lineage", "TF", "Target", "Importance", "Dynamics cor.", "Expression cor.", "State")) %>% kableExtra::kable_classic(full_width =FALSE, "hover")```We repeat the fitting process for the other two lineages (expand this code block to see the details). Skip to @sec-conclusions if you want a version of this logic that's been wrapped into a function!```{r}#| code-fold: true#| message: false#| warning: falsegrn_res_linB <-foreach(g =seq(target_genes), .combine ="list",.multicombine =ifelse(length(target_genes) >1, TRUE, FALSE),.maxcombine =ifelse(length(target_genes) >1, length(target_genes), 2),.packages =c("lightgbm", "dplyr"),.errorhandling ="pass",.inorder =TRUE,.verbose =FALSE) %dopar% { feature_mat <- smoothed_counts$Lineage_B[, valid_hs_tfs] resp_var <- smoothed_counts$Lineage_B[, target_genes[g]] lgbm_data <- lightgbm::lgb.Dataset(data = feature_mat, label = resp_var) lgbm_cv <- lightgbm::lgb.cv(params = lgbm_params, data = lgbm_data, nrounds =100L, nfold =5L, stratified =FALSE) lgbm_model <- lightgbm::lgb.train(params = lgbm_params, data = lgbm_data, nrounds = lgbm_cv$best_iter) imp_table <-as.data.frame(lightgbm::lgb.importance(lgbm_model)) %>% dplyr::mutate(Lineage ="B", Target_Gene = target_genes[g], .before =1) imp_table }names(grn_res_linB) <- target_genesgrn_table_linB <- purrr::imap(grn_res_linB, \(x, y) {if (!inherits(x, "data.frame")) { empty_res <-data.frame(Lineage ="B", Target_Gene = y, Feature =NA_character_, Gain =NA_real_, Cover =NA_real_, Frequency =NA_real_)return(empty_res) } else {return(x) }})grn_table_linB <- purrr::reduce(grn_table_linB, rbind) %>%rename(Tx_Factor = Feature) %>%filter(!is.na(Target_Gene), !is.na(Tx_Factor)) %>%arrange(Tx_Factor, desc(Frequency))dyn_cormat <-cor(smoothed_counts$Lineage_B, method ="spearman")grn_table_linB <-mutate(grn_table_linB, Dyn_Cor =NA_real_, Exp_Cor =NA_real_)for (i inseq(nrow(grn_table_linB))) { grn_table_linB$Dyn_Cor[i] <- dyn_cormat[grn_table_linB$Target_Gene[i], grn_table_linB$Tx_Factor[i]] grn_table_linB$Exp_Cor[i] <- exp_cormat[grn_table_linB$Target_Gene[i], grn_table_linB$Tx_Factor[i]]}grn_table_linB <-mutate(grn_table_linB, State =if_else(Dyn_Cor <0, "Repression", "Activation"))grn_res_linC <-foreach(g =seq(target_genes), .combine ="list",.multicombine =ifelse(length(target_genes) >1, TRUE, FALSE),.maxcombine =ifelse(length(target_genes) >1, length(target_genes), 2),.packages =c("lightgbm", "dplyr"),.errorhandling ="pass",.inorder =TRUE,.verbose =FALSE) %dopar% { feature_mat <- smoothed_counts$Lineage_C[, valid_hs_tfs] resp_var <- smoothed_counts$Lineage_C[, target_genes[g]] lgbm_data <- lightgbm::lgb.Dataset(data = feature_mat, label = resp_var) lgbm_cv <- lightgbm::lgb.cv(params = lgbm_params, data = lgbm_data, nrounds =100L, nfold =5L, stratified =FALSE) lgbm_model <- lightgbm::lgb.train(params = lgbm_params, data = lgbm_data, nrounds = lgbm_cv$best_iter) imp_table <-as.data.frame(lightgbm::lgb.importance(lgbm_model)) %>% dplyr::mutate(Lineage ="C", Target_Gene = target_genes[g], .before =1) imp_table }names(grn_res_linC) <- target_genesgrn_table_linC <- purrr::imap(grn_res_linC, \(x, y) {if (!inherits(x, "data.frame")) { empty_res <-data.frame(Lineage ="C", Target_Gene = y, Feature =NA_character_, Gain =NA_real_, Cover =NA_real_, Frequency =NA_real_)return(empty_res) } else {return(x) }})grn_table_linC <- purrr::reduce(grn_table_linC, rbind) %>%rename(Tx_Factor = Feature) %>%filter(!is.na(Target_Gene), !is.na(Tx_Factor)) %>%arrange(Tx_Factor, desc(Frequency))dyn_cormat <-cor(smoothed_counts$Lineage_C, method ="spearman")grn_table_linC <-mutate(grn_table_linC, Dyn_Cor =NA_real_, Exp_Cor =NA_real_)for (i inseq(nrow(grn_table_linC))) { grn_table_linC$Dyn_Cor[i] <- dyn_cormat[grn_table_linC$Target_Gene[i], grn_table_linC$Tx_Factor[i]] grn_table_linC$Exp_Cor[i] <- exp_cormat[grn_table_linC$Target_Gene[i], grn_table_linC$Tx_Factor[i]]}grn_table_linC <-mutate(grn_table_linC, State =if_else(Dyn_Cor <0, "Repression", "Activation"))parallel::stopCluster(cl)```We collate all three GRNs into a single overall table:```{r}grn_table_all <-bind_rows(grn_table_linA, grn_table_linB, grn_table_linC) %>%arrange(Tx_Factor, Target_Gene)```For each lineage we create a directed, weighted graph of the relationships between TFs and target genes using the `igraph` package. ```{r}graph_df_linA <-select(grn_table_linA, Tx_Factor, Target_Gene, Frequency, State) %>%arrange(Tx_Factor, desc(Frequency)) %>%with_groups(Tx_Factor, slice_head, n =10)graph_obj_linA <-graph.data.frame(graph_df_linA, directed =TRUE)graph_df_linB <-select(grn_table_linB, Tx_Factor, Target_Gene, Frequency, State) %>%arrange(Tx_Factor, desc(Frequency)) %>%with_groups(Tx_Factor, slice_head, n =10)graph_obj_linB <-graph.data.frame(graph_df_linB, directed =TRUE)graph_df_linC <-select(grn_table_linC, Tx_Factor, Target_Gene, Frequency, State) %>%arrange(Tx_Factor, desc(Frequency)) %>%with_groups(Tx_Factor, slice_head, n =10)graph_obj_linC <-graph.data.frame(graph_df_linC, directed =TRUE)```One TF of particular interest is SRY-box transcription factor 4 (*SOX4*). This TF helps to regulate development and is involved in the specification of cell fate, so a comparison of its behavior across lineages should be pretty neat. We use a Fruchterman-Reingold layout to embed the graphs in 2D space via the `layout_with_fr()` function from `igraph`. We see that not only do the top target genes differ widely by lineage, so too do the directions of the relationships (activation vs. repression), and the strengths of those relationships. For example, in lineage A the top target gene is `r filter(graph_df_linA, Tx_Factor == "SOX4") %>% slice_head(n = 1) %>% pull(Target_Gene)`, whereas for lineage B it's `r filter(graph_df_linB, Tx_Factor == "SOX4") %>% slice_head(n = 1) %>% pull(Target_Gene)`. ```{r}#| code-fold: true#| fig-width: 10#| fig-height: 15#| fig-cap: Fruchterman-Reingold embeddings of the relationships between SRY-box transcription factor 4 and lineage-specific target genes#| label: fig-FR_embedset.seed(1)freqs <- purrr::map(list(graph_df_linA, graph_df_linB, graph_df_linC), \(x) {filter(x, Tx_Factor =="SOX4") %>%pull(Frequency)})freq_min <-min(purrr::reduce(freqs, c))freq_max <-max(purrr::reduce(freqs, c))p11 <-fortify(graph_obj_linA, layout =layout_with_fr(graph = graph_obj_linA, grid ="nogrid", weights =E(graph_obj_linA)$weights^3),arrow.gap = .02) %>%mutate(Gene_Type =if_else(name %in% hs_tfs$hgnc_symbol, "TF", "Target Gene")) %>%filter(name %in% (filter(graph_df_linA, Tx_Factor =="SOX4") %>%pull(Target_Gene)) | name =="SOX4") %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +facet_wrap(~"Lineage A") +geom_edges(aes(linewidth = Frequency, alpha = Frequency, color = State), arrow =arrow(length =unit(8, "pt"), type ="closed")) +geom_nodelabel(aes(label = name, fill = Gene_Type), size =5) +scale_color_manual(values =c("forestgreen", "firebrick")) +scale_fill_manual(values =c("darkorange2", "dodgerblue3")) +scale_linewidth_continuous(range =c(0.5, 1.5), limits =c(freq_min, freq_max)) +scale_alpha_continuous(range =c(0.75, 1), limits =c(freq_min, freq_max)) +labs(x ="FR 1", y ="FR 2", color ="TF State", fill ="Gene Type") +theme_scLANE(umap =TRUE) +guides(fill =guide_legend(override.aes =list(label =""), order =1), color =guide_legend(order =2))p12 <-fortify(graph_obj_linB, layout =layout_with_fr(graph = graph_obj_linB, grid ="nogrid", weights =E(graph_obj_linB)$weights^3),arrow.gap = .02) %>%mutate(Gene_Type =if_else(name %in% hs_tfs$hgnc_symbol, "TF", "Target Gene")) %>%filter(name %in% (filter(graph_df_linB, Tx_Factor =="SOX4") %>%pull(Target_Gene)) | name =="SOX4") %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +facet_wrap(~"Lineage B") +geom_edges(aes(linewidth = Frequency, alpha = Frequency, color = State), arrow =arrow(length =unit(8, "pt"), type ="closed")) +geom_nodelabel(aes(label = name, fill = Gene_Type), size =5) +scale_color_manual(values =c("forestgreen", "firebrick")) +scale_fill_manual(values =c("darkorange2", "dodgerblue3")) +scale_linewidth_continuous(range =c(0.5, 1.5), limits =c(freq_min, freq_max)) +scale_alpha_continuous(range =c(0.75, 1), limits =c(freq_min, freq_max)) +labs(x ="FR 1", y ="FR 2", color ="TF State", fill ="Gene Type") +theme_scLANE(umap =TRUE) +guides(fill =guide_legend(override.aes =list(label =""), order =1), color =guide_legend(order =2))p13 <-fortify(graph_obj_linC, layout =layout_with_fr(graph = graph_obj_linC, grid ="nogrid", weights =E(graph_obj_linC)$weights^3),arrow.gap = .02) %>%mutate(Gene_Type =if_else(name %in% hs_tfs$hgnc_symbol, "TF", "Target Gene")) %>%filter(name %in% (filter(graph_df_linC, Tx_Factor =="SOX4") %>%pull(Target_Gene)) | name =="SOX4") %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +facet_wrap(~"Lineage C") +geom_edges(aes(linewidth = Frequency, alpha = Frequency, color = State), arrow =arrow(length =unit(8, "pt"), type ="closed")) +geom_nodelabel(aes(label = name, fill = Gene_Type), size =5) +scale_color_manual(values =c("forestgreen", "firebrick")) +scale_fill_manual(values =c("darkorange2", "dodgerblue3")) +scale_linewidth_continuous(range =c(0.5, 1.5), limits =c(freq_min, freq_max)) +scale_alpha_continuous(range =c(0.75, 1), limits =c(freq_min, freq_max)) +labs(x ="FR 1", y ="FR 2", color ="TF State", fill ="Gene Type") +theme_scLANE(umap =TRUE) +guides(fill =guide_legend(override.aes =list(label =""), order =1), color =guide_legend(order =2))p14 <- (p11 / p12 / p13) +plot_layout(guides ="collect")p14```# Conclusions {#sec-conclusions}Hopefully this analysis has demonstrated the utility of trajectory GRNs for comparing gene dynamics across lineages. Beyond just visual comparison of how gene expression changes over pseudotime, trajectory GRNs allow us to identify which transcription factors are activating or suppressing target genes, how top TF targets differ between lineages, and the ways in which TF-target relationships change based on the cell fate towards which cells are progressing. To conclude, I've written up the methodology for creating a trajectory GRN into a function, which is shown in the code block below. It includes functionality for a progress bar if desired, and takes as input a matrix of gene dynamics (`expr.mat`), a set of dynamic genes (`dyn.genes`), and a set of transcription factors (`tx.factors`). Eventually I plan to expand this functionality and turn it into an R package, but for now it'll stay a simple function. ```{r}#| code-fold: trueestimateTGRN <-function(expr.mat =NULL,dyn.genes =NULL, tx.factors =NULL, log1p.norm =TRUE,cor.method ="spearman",n.cores =4L, verbose =TRUE, random.seed =312) {# check inputs if (is.null(expr.mat) ||is.null(dyn.genes) ||is.null(tx.factors)) { stop("Arguments to estimateTGRN() are missing.") }# ID non-TF genes & filter out non-dynamic TFs target_genes <- dyn.genes[!dyn.genes %in% tx.factors] tx.factors <- tx.factors[tx.factors %in% dyn.genes]# normalize matrix of dynamicsif (log1p.norm) { expr.mat <-log1p(expr.mat) }# set up progress barif (verbose) { withr::with_output_sink(tempfile(), { pb <- utils::txtProgressBar(0, length(target_genes), style =3) }) progress_fun <-function(n) utils::setTxtProgressBar(pb, n) snow_opts <-list(progress = progress_fun) } else { snow_opts <-list() }# set up parallel processingif (n.cores >1L) { cl <- parallel::makeCluster(6L) doSNOW::registerDoSNOW(cl) } else { cl <- foreach::registerDoSEQ() }# set up LightGBM model settings lgbm_params <-list(objective ="gamma", tree_learner ="serial", metric ="l2", boosting_type ="gbdt", num_threads =1L, seed = random.seed)# estimate trajectory GRN grn_res <- foreach::foreach(g =seq(target_genes), .combine ="list",.multicombine =ifelse(length(target_genes) >1, TRUE, FALSE),.maxcombine =ifelse(length(target_genes) >1, length(target_genes), 2),.packages =c("lightgbm", "dplyr"),.export =c("expr.mat", "tx.factors", "target_genes", "lgbm_params"), .errorhandling ="pass",.inorder =TRUE,.verbose =FALSE,.options.snow = snow_opts) %dopar% { feature_mat <- expr.mat[, tx.factors] resp_var <- expr.mat[, target_genes[g]] lgbm_data <- lightgbm::lgb.Dataset(data = feature_mat, label = resp_var) lgbm_cv <- lightgbm::lgb.cv(params = lgbm_params, data = lgbm_data, nrounds =100L, nfold =5L, stratified =FALSE) lgbm_model <- lightgbm::lgb.train(params = lgbm_params, data = lgbm_data, nrounds = lgbm_cv$best_iter) imp_table <-as.data.frame(lightgbm::lgb.importance(lgbm_model)) %>% dplyr::mutate(Target_Gene = target_genes[g], .before =1) imp_table }names(grn_res) <- target_genesif (n.cores >1L) { parallel::stopCluster(cl) }# format GRN table grn_table <- purrr::imap(grn_res, \(x, y) {if (!inherits(x, "data.frame")) { empty_res <-data.frame(Target_Gene = y, Feature =NA_character_, Gain =NA_real_, Cover =NA_real_, Frequency =NA_real_)return(empty_res) } else {return(x) } }) grn_table <- purrr::reduce(grn_table, rbind) %>% dplyr::rename(Tx_Factor = Feature) %>% dplyr::filter(!is.na(Target_Gene), !is.na(Tx_Factor))# add correlation b/t TF and each target gene dyn_cormat <- stats::cor(expr.mat[, dyn.genes], method = cor.method) grn_table <- dplyr::mutate(grn_table, Dyn_Cor =NA_real_)for (i inseq(nrow(grn_table))) { grn_table$Dyn_Cor[i] <- dyn_cormat[grn_table$Target_Gene[i], grn_table$Tx_Factor[i]] } grn_table <- dplyr::mutate(grn_table, State = dplyr::if_else(Dyn_Cor <0, "Repression", "Activation"))return(grn_table)}```# Session info {#sec-SI}```{r}sessioninfo::session_info()```